The aim of this part of the project is to use occurrence data of bivalves from across the Cretaceous–Palaeogene boundary to reconstruct extinction prevalence and potential drivers of this.
renv::restore()
Renv, used above, sets up a project environment that stores the packages and versions used to make the project work. The idea of this is to enable you to go back to this after some time and get the project working again, no matter what’s changed or been updated in the meantime. Similarly, send the project to a collaborator and they can set things up to be working quickly and easily without having to search through packages to install. I’ve already initialized this project, to get things going on your computer should a matter of opening R in this filter. Renv will then install itself and any packages needed into its own library.
The following packages are used:
library(ncdf4)
library(tidyverse)
library(brms)
library(bayesplot)
library(CoordinateCleaner)
library(countrycode)
library(GeoRange)
library(jsonlite)
library(rgeos)
library(rnaturalearth)
library(rnaturalearthdata)
library(sf)
library(stars)
library(stringdist)
library(viridis)
library(ncdump) # needs to go after tidyverse I think
library(lattice)
options(mc.cores = parallel::detectCores())
rstan::rstan_options(auto_write = TRUE)
I’m rather opinionated about code and formatting. I think it’s generally better to be strict about the packages you use and how you write the code. I tend to use tidyverse style and formatting, which has a guide written by Hadley Wickham, who’s created some of the most used R packages. Google have a style guide for writing R code that I recommend following, but as a guide can and should be changed to meet your needs.
I use piping a lot. This means passing the result of one
function as the first argument of the next line. R has a native version
shown as %>%, but this is rather new (version 4.1.0).
tidyverse and package magrittr has had a different
version for while that looks like %>% and is probably
the better to use, so you might see that online more often.
This guide is written in R Markdown, a text format that combines text and code and can be run easily in R Studio and other environments. I also tend to use ggplot for plotting, so have a look here for the low-down on that package.
The data come from the Paleobiology Database (PBDB). Occurrences are downloaded though their API.
First we can download the data as a JSON file and save this to disc: this is our master data set. For this, I’ve gathered occurrences of Bivalvia from the Cretaceous and Palaeogene with all of the occurrence, geographical, systematic, and ecological data. This only need be done once (and I’ve included the file in the repository already).
bivalve_pbdb_url <-
paste0(
"https://paleobiodb.org/data1.2/", # PBDB API URL
"occs/", # occurrences
"list.json?", # download a JSON file
"base_name=Bivalvia&", # download bivalve occurrences
"interval=Cretaceous,Paleogene&", # temporal range to include
"show=full" # include full data
)
bivalve_pbdb_file <- paste0(Sys.Date(), "-bivalve_pbdb_occurrences.json")
download.file(bivalve_pbdb_url, bivalve_pbdb_file, method = "curl")
The data file can then be loaded into R, which gives a list with
length two: the first element ($elapsed_time) gives the
original download time and the second element ($records) is
a data.frame of the PBDB occurrence data. We keep just this
second part. To save space, I’ve downloaded a version and stored this as
a compressed tar file that needs to be expanded first with
untar.
bivalve_pbdb_file <- "2022-03-18-bivalve_pbdb_occurrences"
untar(paste0(bivalve_pbdb_file, ".tgz"))
bivalve_pbdb_data <-
jsonlite::fromJSON(paste0(bivalve_pbdb_file, ".json"))$records %>%
as_tibble() %>%
mutate(across(c(rnk, lng, lat), as.numeric))
We won’t need data from all the columns, but the important few are:
oid: occurrence ID, a unique reference number.idn: the originally identified name.tna: the currently accepted name.rnk: taxonomic rank of the accepted name.oei, oli: earliest and latest occurrence
interval names.
eag and lag are ages at the beginning and
end of these intervals, taken from the Geologic Time
Scale, but may want to update to a recent version.fml: family of the occurring taxon.
lng, lat: modern longitude and latitude of
the occurrence.
pln and pln are the reconstructed
palaeocoordinates, but it may be worth re-doing these to confirm or with
a different tectonic model.env: the palaeoenvironment, useful to check localities
and habitats, exclude freshwater species etc.jlh), general environment
(jev), feeding mode (jdt).jco: composition/mineralogy of shells.A full list of the fields is at the bottom of this page.
It’s a good check to have a quick look at the data to check its size and what different values are contained within. Most of the PBDB data are categorical or strings, which makes it a little more difficult (you can’t plot a bar chart for instance, except you can). A few things to check:
bivalve_pbdb_data %>%
filter(rnk == 3) %>%
group_by(tna) %>%
summarize(n())
## # A tibble: 3,935 × 2
## tna `n()`
## <chr> <int>
## 1 Abra (Abra) petropolitana 6
## 2 Abra (Syndosmya) cygnea 1
## 3 Abra (Syndosmya) splendens 4
## 4 Abra deshayesi 2
## 5 Abra nitens 40
## 6 Abra recluzii 1
## 7 Abra suessoniensis 2
## 8 Acanthocardia (Acanthocardia) reedi 2
## 9 Acanthocardia (Schedocardia) tuomeyi 31
## 10 Acanthocardia becksii 2
## # … with 3,925 more rows
Or a more fun thing to plot the occurrences from different families.
family_occurrences <-
bivalve_pbdb_data %>%
filter(rnk == c(3, 4, 5)) %>%
group_by(fml) %>%
summarize(family_occurrences = n())
family_occurrences %>%
ggplot(
aes(x = reorder(fml, desc(family_occurrences)), y = family_occurrences)
) +
geom_col()
family_occurrences %>%
slice_max(family_occurrences, n = 10)
## # A tibble: 12 × 2
## fml family_occurrences
## <chr> <int>
## 1 Veneridae 1593
## 2 Inoceramidae 1193
## 3 Ostreidae 1153
## 4 Gryphaeidae 1122
## 5 Corbulidae 991
## 6 Pectinidae 972
## 7 Cardiidae 971
## 8 Carditidae 859
## 9 Lucinidae 777
## 10 Nuculanidae 774
## 11 Radiolitidae 774
## 12 Tellinidae 774
Now we can do a more thorough look through the data to check that it’s correct, or at least in the form we expect it to be. Fortunately there is a handy package, CoordinateCleaner, that can help with this. It checks for coordinates that are present in the middle of countries or at (0, 0) — implying they have imprecise, or generic location data — or those occurrences that are spatiotemporal outliers — indicating they may not be correct identifications. Q guide through these checks is here.
This first step checks that the coordinates are included as numbers
and that there aren’t any potential errors — such as matching latitude
and longitude values. It’s recommended to also check which record, if
any, are flagged by adding value = "flagged" to all
commands.
occ_checking <-
bivalve_pbdb_data %>%
cc_val(lat = "lat", lon = "lng") %>%
cc_equ(lat = "lat", lon = "lng")
## Testing equal lat/lon
## Testing coordinate validity
## Removed 0 records.
## Removed 0 records.
Next we check for the location proximity to country centres, indicating imprecision.
occ_precision <-
occ_checking %>%
cc_cen(lat = "lat", lon = "lng", value = "flagged")
## Testing country centroids
## Flagged 203 records.
occ_precision_fl <-
bivalve_pbdb_data %>%
filter(!occ_precision)
Then we check whether occurrences are located within the country they are meant to come from. This requires using ISO 3-letter country codes, while the PBDB uses 2-letter codes, so some conversion is needed; it also uses UK while the country codes is GB and GBR.
# Add record for UK -> GBR, AA -> ATA [Antarctica]
cs_ma <- c("GBR", "ATA")
names(cs_ma) <- c("UK", "AA")
occ_checking <-
occ_checking %>%
mutate(
cc_iso3 = countrycode(
cc2,
origin = "iso2c",
destination = "iso3c",
custom_match = cs_ma
)
)
occ_country <-
occ_checking %>%
cc_coun(lat = "lat", lon = "lng", iso3 = "cc_iso3", value = "flagged")
## Testing country identity
## Flagged 6713 records.
occ_country_fl <-
bivalve_pbdb_data %>%
filter(!occ_country)
Fourth, is to check against host institutions on the Global Biodiversity Interchange Facility, in case the locations are listed as that rather than the in-the-field locality.
occ_inst <-
occ_checking %>%
cc_inst(lat = "lat", lon = "lng", value = "flagged")
## Testing biodiversity institutions
## Flagged 0 records.
occ_gbif <-
occ_checking %>%
cc_gbif(lat = "lat", lon = "lng", value = "flagged")
## Testing GBIF headquarters, flagging records around Copenhagen
## Flagged 5 records.
And finally, check for records at (0, 0).
occ_zero_zero <-
occ_checking %>%
cc_zero(lat = "lat", lon = "lng", value = "flagged")
## Testing zero coordinates
## Flagged 0 records.
Alongside check that occurrences are not at potentially imprecise/incorrect localities, also see whether any are erroneous separated in time is useful. First removing occurrences without ages.
occ_empty_ages <-
occ_checking %>%
filter(
!is.na(lag),
!is.na(eag)
) %>%
cf_equal(min_age = "lag", max_age = "eag", value = "flagged")
## Testing age validity
## Flagged 0 records.
Next, check on the precision of dating as the very imprecisely dated occurrences can be readily excluded.
bivalve_pbdb_data %>%
mutate(dating_precision = eag - lag) %>%
ggplot(aes(dating_precision)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram of Bivalvia occurrence dating ranges from the Paleobiology Database.
We’re mostly doing okay, with precision less than 20 Ma, but some extend all the way up to 140 Ma dates; we don’t want those!CoordinateCleaner also provides a way to clean based on range comparisons between taxa and comparing similarities — excluding occurrences based on the range of differences of dating precision rather than just setting an arbitrary cutoff. But an absolute value can sometimes be useful: in this case using 35 Ma as the length of the Late Cretaceous. If occurrences can’t certainly be dated within that time frame than we can readily exclude them. As a check of how removing these occurrences affects the spread of ranges, we plot the histogram again.
occ_ranges_total <-
occ_checking %>%
cf_range(taxon = "", min_age = "lag", max_age = "eag", value = "flagged")
## Testing temporal range outliers on dataset level
## Flagged 979 records.
occ_ranges_taxon <-
occ_checking %>%
cf_range(taxon = "tna", min_age = "lag", max_age = "eag", value = "flagged")
## Testing temporal range outliers on taxon level
## Flagged 2122 records.
occ_ranges_absolute <-
occ_checking %>%
cf_range(
taxon = "tna",
min_age = "lag", max_age = "eag",
method = "time", max_range = 35, value = "flagged"
)
## Testing temporal range outliers on taxon level
## Flagged 150 records.
bivalve_pbdb_data %>%
filter(occ_ranges_total & occ_ranges_taxon & occ_ranges_absolute) %>%
mutate(dating_precision = eag - lag) %>%
ggplot(aes(dating_precision)) +
geom_histogram(bins = 50)
Histogram of cleaned occurrence dating ranges.
Most of the occurrences in this data set can be dated within 10 Ma, so that’s good.
Now for checking outliers in time and space — those that are much separated from their main cluster and so may be a misidentification.
# this function exhausted memory, so I don't run it
# occ_outliers_total <-
# occ_checking %>%
# cf_outl(
# taxon = "",
# lat = "lat", lon = "lng",
# min_age = "lag", max_age = "eag",
# value = "flagged"
# )
occ_outliers_taxon <-
occ_checking %>%
cf_outl(
taxon = "tna",
lat = "lat", lon = "lng",
min_age = "lag", max_age = "eag",
value = "flagged"
)
## Testing spatio-temporal outliers on taxon level
## Flagged 873 records.
The above functions let you go through and check the removed
occurrences for the reasons and to make sure they should be removed.
This code below does all the cleaning automatically using the helper
function clean_fossils.
cs_ma <- c(
"UK" = "GBR",
"AA" = "ATA"
)
bivalve_pbdb_data <-
bivalve_pbdb_data %>%
mutate(
cc_iso3 = countrycode(cc2,
origin = "iso2c", destination = "iso3c", custom_match = cs_ma
)
)
bivalve_cleaned <-
bivalve_pbdb_data %>%
clean_fossils(
taxon = "tna",
min_age = "lag", max_age = "eag",
lon = "lng", lat = "lat",
countries = "cc_iso3", value = "clean"
) %>%
filter(str_detect(jev, "marine|shelf|oceanic"))
## Testing coordinate validity
## Flagged 0 records.
## Testing equal lat/lon
## Flagged 0 records.
## Testing zero coordinates
## Flagged 0 records.
## Testing country centroids
## Flagged 120 records.
## Testing spatio-temporal outliers on taxon level
## Warning in cf_outl(x = x, lon = lon, lat = lat, min_age = min_age, max_age
## = max_age, : decimallatitude not found. Using lat instead.
## Warning in cf_outl(x = x, lon = lon, lat = lat, min_age = min_age, max_age
## = max_age, : decimallongitude not found. Using lng instead.
## Flagged 884 records.
## Warning in cf_range(x = test, taxon = "", min_age = min_age, max_age =
## max_age, : decimallatitude not found. Using lat instead.
## Warning in cf_range(x = test, taxon = "", min_age = min_age, max_age =
## max_age, : decimallongitude not found. Using lng instead.
## Testing temporal range outliers on dataset level
## Flagged 979 records.
## Warning in cf_range(ran.test, taxon = taxon, min_age = min_age, max_age =
## max_age, : decimallatitude not found. Using lat instead.
## Warning in cf_range(ran.test, taxon = taxon, min_age = min_age, max_age =
## max_age, : decimallongitude not found. Using lng instead.
## Testing temporal range outliers on taxon level
## Flagged 2097 records.
## Testing age validity
## Flagged 0 records.
## Testing GBIF headquarters, flagging records around Copenhagen
## Flagged 5 records.
## Testing biodiversity institutions
## Flagged 0 records.
## Flagged 4002 of 74262 records, EQ = 0.05
That last line filters only those occurrences from
marine/shelf/oceanic environments, based on the lithology. Have a look
through the column and see whether you want to include any other
environments; separate them with the vertical bar — |,
meaning ‘or’ in regular expressions. The clean_fossils
function outputs useful messages to show what stage it’s at; I get some
warnings about decimallatitude and decimallongitude, but I don’t think
that’s anything to worry about.
A final set of things to check is for names that might have misspellings or missing taxonomy data.
One of the features of many names is including a subgenus name in parentheses. While taxonomically important, this can be problematic as some taxa will include the subgenus, while others may not depending on the taxonomy. We thus remove the subgenus name. At this point, it’s also useful to convert the strings to lowercase to remove any errors from capitalization. Searching and modifying strings in R uses a system ‘regular expressions’: a series of indicators to select certain groups of characters. In the example below1, we’re looking for a set of lowercase letters1, surrounded by parentheses and followed by a space. Have a look at this cheat sheet for stringr (there are cheat sheets for some other packages to look at here: https://www.rstudio.com/resources/cheatsheets/).
bivalve_cleaned <-
bivalve_cleaned %>%
mutate(tna_trimmed = str_remove_all(str_to_lower(tna), "\\(([:lower:]+)\\)\\s*"))
Next, we check for spelling differences. Here, we use the
distance between strings, in essence the number of characters
different, to check whether there are misspelled names. Package
stringdist has the function stringdistmatrix that
calculates the pairwise distances within a string vector, thus showing
the most similarly-spelled taxon names. Using an arbitrary distance
cutoff — in this case 4, but string_cutoff can be
changed — will show the most similar strings.
str_dist_matrix <-
bivalve_cleaned %>%
filter(rnk == 3) %>%
pull(tna_trimmed) %>%
unique() %>%
stringdistmatrix(useNames = "strings", method = "lcs") %>%
as.matrix()
string_cutoff <- 4
str_indices <-
tibble(
pos_n = which(str_dist_matrix <= string_cutoff & str_dist_matrix > 0),
row_n = pos_n %/% dim(str_dist_matrix)[2],
row_n_fixed = replace(row_n, row_n == 0, dim(str_dist_matrix)[2]) + 1, # need to add 1 to offset mod(1583) = 0
col_n = pos_n %% dim(str_dist_matrix)[1],
col_n_fixed = replace(col_n, col_n == 0, dim(str_dist_matrix)[1]),
row = rownames(str_dist_matrix)[row_n_fixed],
col = colnames(str_dist_matrix)[col_n_fixed],
distance = str_dist_matrix[pos_n]
) %>%
arrange(distance)
write_csv(str_indices, "./output/similar_spellings.csv")
At this point, we check the names against accepted names in WoRMS, LifeWatch, GBIF, or Treatise on Invertebrate Paleontology. Then, we use a conversion vector to correct these in the data: the incorrect version is before the equals and the corrected version after. Note this output table shows differences both ways (row and column), so each minor difference is shown twice with the names switched.
name_corrections <-
c(
"syncyclonema haggi" = "syncyclonema haeggi",
"incorrect" = "correct"
)
bivalve_cleaned <-
bivalve_cleaned %>%
mutate(tna_trimmed = str_replace_all(tna_trimmed, name_corrections))
We now use tna_trimmed as the main taxon
name-column.
We can check what families are in included in the data quite easily but pulling this column out and finding unique records.
bivalve_cleaned %>%
pull(fml) %>%
unique()
## [1] "Nuculidae" "Nuculanidae" "Pteriidae"
## [4] "Pectinidae" "Yoldiidae" "Glycymerididae"
## [7] "Pinnidae" "Limopsidae" "Arcidae"
## [10] "Mytilidae" "Entoliidae" "Anomiidae"
## [13] "Limidae" "Gryphaeidae" "Ostreidae"
## [16] "Flemingostreidae" "Solemyidae" "Antillocaprinidae"
## [19] "Bakevelliidae" "Plicatulidae" "Neitheidae"
## [22] "Pulvinitinae" "Dimyidae" "Monopleuridae"
## [25] "Spondylidae" "Malleidae" "Propeamussiidae"
## [28] "NO_FAMILY_SPECIFIED" "Radiolitidae" "Requieniidae"
## [31] "Hippuritidae" "Terquemiidae" "Trechmannellidae"
## [34] "Buchiidae" "Caprinidae" "Plagioptychidae"
## [37] "Oxytomidae" "Caprinuloideidae" "Heteropectinidae"
## [40] "Caprinulidae" "Epidiceratidae" "Polyconitidae"
## [43] "Ichthyosarcolitidae" NA "Philobryidae"
## [46] "Manzanellidae" "Palaeolophidae" "Diceratidae"
## [49] "Cardiidae" "Cassianellidae" "Posidoniidae"
## [52] "Pergamidiidae" "Chondrodontidae" "Aviculopectinidae"
## [55] "Pseudomonotidae" "Caprotinidae"
We see that there are two errant values: “NO_FAMILY_SPECIFIED” and
NA.
errant_taxonomy <-
bivalve_cleaned %>%
filter(fml == "NO_FAMILY_SPECIFIED" | is.na(fml)) %>%
select(tna_trimmed, fml, rnk)
print(errant_taxonomy, n = Inf)
## # A tibble: 387 × 3
## tna_trimmed fml rnk
## <chr> <chr> <dbl>
## 1 "entolium" NO_FAMILY_SPECIFIED 5
## 2 "entolium" NO_FAMILY_SPECIFIED 5
## 3 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 4 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 5 "entolium" NO_FAMILY_SPECIFIED 5
## 6 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 7 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 8 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 9 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 10 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 11 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 12 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 13 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 14 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 15 "entolium" NO_FAMILY_SPECIFIED 5
## 16 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 17 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 18 "entolium" NO_FAMILY_SPECIFIED 5
## 19 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 20 "entolium" NO_FAMILY_SPECIFIED 5
## 21 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 22 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 23 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 24 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 25 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 26 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 27 "balabania" NO_FAMILY_SPECIFIED 5
## 28 "balabania" NO_FAMILY_SPECIFIED 5
## 29 "balabania" NO_FAMILY_SPECIFIED 5
## 30 "balabania" NO_FAMILY_SPECIFIED 5
## 31 "dechaseauxia" NO_FAMILY_SPECIFIED 5
## 32 "hardaghia" NO_FAMILY_SPECIFIED 5
## 33 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 34 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 35 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 36 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 37 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 38 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 39 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 40 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 41 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 42 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 43 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 44 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 45 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 46 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 47 "entolium" NO_FAMILY_SPECIFIED 5
## 48 "entolium" NO_FAMILY_SPECIFIED 5
## 49 "entolium" NO_FAMILY_SPECIFIED 5
## 50 "entolium" NO_FAMILY_SPECIFIED 5
## 51 "entolium" NO_FAMILY_SPECIFIED 5
## 52 "entolium" NO_FAMILY_SPECIFIED 5
## 53 "entolium" NO_FAMILY_SPECIFIED 5
## 54 "entolium" NO_FAMILY_SPECIFIED 5
## 55 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 56 "entolium" NO_FAMILY_SPECIFIED 5
## 57 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 58 "entolium" NO_FAMILY_SPECIFIED 5
## 59 "entolium" NO_FAMILY_SPECIFIED 5
## 60 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 61 "entolium" NO_FAMILY_SPECIFIED 5
## 62 "entolium" NO_FAMILY_SPECIFIED 5
## 63 "entolium" NO_FAMILY_SPECIFIED 5
## 64 "entolium" NO_FAMILY_SPECIFIED 5
## 65 "entolium" NO_FAMILY_SPECIFIED 5
## 66 "entolium" NO_FAMILY_SPECIFIED 5
## 67 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 68 "entolium" NO_FAMILY_SPECIFIED 5
## 69 "entolium" NO_FAMILY_SPECIFIED 5
## 70 "entolium" NO_FAMILY_SPECIFIED 5
## 71 "entolium" NO_FAMILY_SPECIFIED 5
## 72 "entolium" NO_FAMILY_SPECIFIED 5
## 73 "entolium" NO_FAMILY_SPECIFIED 5
## 74 "entolium" NO_FAMILY_SPECIFIED 5
## 75 "entolium" NO_FAMILY_SPECIFIED 5
## 76 "entolium" NO_FAMILY_SPECIFIED 5
## 77 "entolium" NO_FAMILY_SPECIFIED 5
## 78 "entolium" NO_FAMILY_SPECIFIED 5
## 79 "entolium" NO_FAMILY_SPECIFIED 5
## 80 "entolium" NO_FAMILY_SPECIFIED 5
## 81 "entolium" NO_FAMILY_SPECIFIED 5
## 82 "entolium" NO_FAMILY_SPECIFIED 5
## 83 "entolium" NO_FAMILY_SPECIFIED 5
## 84 "entolium" NO_FAMILY_SPECIFIED 5
## 85 "entolium" NO_FAMILY_SPECIFIED 5
## 86 "entolium" NO_FAMILY_SPECIFIED 5
## 87 "entolium" NO_FAMILY_SPECIFIED 5
## 88 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 89 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 90 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 91 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 92 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 93 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 94 "entolium" NO_FAMILY_SPECIFIED 5
## 95 "entolium" NO_FAMILY_SPECIFIED 5
## 96 "ostreidina" <NA> 12
## 97 "ostreidina" <NA> 12
## 98 "megadiceras" NO_FAMILY_SPECIFIED 5
## 99 "megadiceras" NO_FAMILY_SPECIFIED 5
## 100 "hippuritacea" <NA> 10
## 101 "hippuritacea" <NA> 10
## 102 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 103 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 104 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 105 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 106 "hippuritida" <NA> 13
## 107 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 108 "entolium" NO_FAMILY_SPECIFIED 5
## 109 "entolium" NO_FAMILY_SPECIFIED 5
## 110 "entolium" NO_FAMILY_SPECIFIED 5
## 111 "entolium" NO_FAMILY_SPECIFIED 5
## 112 "entolium" NO_FAMILY_SPECIFIED 5
## 113 "entolium" NO_FAMILY_SPECIFIED 5
## 114 "entolium" NO_FAMILY_SPECIFIED 5
## 115 "entolium" NO_FAMILY_SPECIFIED 5
## 116 "entolium" NO_FAMILY_SPECIFIED 5
## 117 "entolium" NO_FAMILY_SPECIFIED 5
## 118 "entolium" NO_FAMILY_SPECIFIED 5
## 119 "entolium" NO_FAMILY_SPECIFIED 5
## 120 "entolium" NO_FAMILY_SPECIFIED 5
## 121 "entolium" NO_FAMILY_SPECIFIED 5
## 122 "entolium" NO_FAMILY_SPECIFIED 5
## 123 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 124 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 125 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 126 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 127 "hippuritida" <NA> 13
## 128 "entolium" NO_FAMILY_SPECIFIED 5
## 129 "ostreacea" <NA> 10
## 130 "entolium" NO_FAMILY_SPECIFIED 5
## 131 "entolium" NO_FAMILY_SPECIFIED 5
## 132 "entolium" NO_FAMILY_SPECIFIED 5
## 133 "entolium" NO_FAMILY_SPECIFIED 5
## 134 "entolium" NO_FAMILY_SPECIFIED 5
## 135 "hippuritacea" <NA> 10
## 136 "entolium" NO_FAMILY_SPECIFIED 5
## 137 "entolium" NO_FAMILY_SPECIFIED 5
## 138 "entolium" NO_FAMILY_SPECIFIED 5
## 139 "entolium" NO_FAMILY_SPECIFIED 5
## 140 "entolium" NO_FAMILY_SPECIFIED 5
## 141 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 142 "rhedensia" NO_FAMILY_SPECIFIED 5
## 143 "entolium" NO_FAMILY_SPECIFIED 5
## 144 "entolium" NO_FAMILY_SPECIFIED 5
## 145 "entolium" NO_FAMILY_SPECIFIED 5
## 146 "entolium" NO_FAMILY_SPECIFIED 5
## 147 "entolium" NO_FAMILY_SPECIFIED 5
## 148 "entolium" NO_FAMILY_SPECIFIED 5
## 149 "entolium" NO_FAMILY_SPECIFIED 5
## 150 "entolium" NO_FAMILY_SPECIFIED 5
## 151 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 152 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 153 "entolium" NO_FAMILY_SPECIFIED 5
## 154 "entolium" NO_FAMILY_SPECIFIED 5
## 155 "praecaprina" NO_FAMILY_SPECIFIED 5
## 156 "praecaprina" NO_FAMILY_SPECIFIED 5
## 157 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 158 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 159 "hippuritacea" <NA> 10
## 160 "entolium " NO_FAMILY_SPECIFIED 4
## 161 "ostreida" <NA> 13
## 162 "ostreida" <NA> 13
## 163 "ostreida" <NA> 13
## 164 "ostreida" <NA> 13
## 165 "entolium" NO_FAMILY_SPECIFIED 5
## 166 "entolium" NO_FAMILY_SPECIFIED 5
## 167 "entolium" NO_FAMILY_SPECIFIED 5
## 168 "entolium" NO_FAMILY_SPECIFIED 5
## 169 "entolium" NO_FAMILY_SPECIFIED 5
## 170 "entolium" NO_FAMILY_SPECIFIED 5
## 171 "entolium" NO_FAMILY_SPECIFIED 5
## 172 "entolium" NO_FAMILY_SPECIFIED 5
## 173 "entolium" NO_FAMILY_SPECIFIED 5
## 174 "entolium" NO_FAMILY_SPECIFIED 5
## 175 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 176 "tetravaccinites" NO_FAMILY_SPECIFIED 5
## 177 "pseudopironaea" NO_FAMILY_SPECIFIED 5
## 178 "entolium" NO_FAMILY_SPECIFIED 5
## 179 "entolium" NO_FAMILY_SPECIFIED 5
## 180 "hippuritida" <NA> 13
## 181 "hippuritida" <NA> 13
## 182 "hippuritida" <NA> 13
## 183 "hippuritida" <NA> 13
## 184 "entolium" NO_FAMILY_SPECIFIED 5
## 185 "entolium" NO_FAMILY_SPECIFIED 5
## 186 "hippuritida" <NA> 13
## 187 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 188 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 189 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 190 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 191 "hippuritida" <NA> 13
## 192 "pectinoidea" <NA> 10
## 193 "hippuritida" <NA> 13
## 194 "hippuritida" <NA> 13
## 195 "hippuritida" <NA> 13
## 196 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 197 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 198 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 199 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 200 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 201 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 202 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 203 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 204 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 205 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 206 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 207 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 208 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 209 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 210 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 211 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 212 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 213 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 214 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 215 "ostreoidea" <NA> 10
## 216 "ostreoidea" <NA> 10
## 217 "ostreoidea" <NA> 10
## 218 "ostreoidea" <NA> 10
## 219 "ostreoidea" <NA> 10
## 220 "ostreoidea" <NA> 10
## 221 "ostreoidea" <NA> 10
## 222 "ostreoidea" <NA> 10
## 223 "ostreoidea" <NA> 10
## 224 "ostreoidea" <NA> 10
## 225 "ostreoidea" <NA> 10
## 226 "ostreoidea" <NA> 10
## 227 "ostreoidea" <NA> 10
## 228 "ostreoidea" <NA> 10
## 229 "ostreoidea" <NA> 10
## 230 "ostreoidea" <NA> 10
## 231 "ostreoidea" <NA> 10
## 232 "ostreoidea" <NA> 10
## 233 "ostreoidea" <NA> 10
## 234 "hatayia" NO_FAMILY_SPECIFIED 5
## 235 "branislavia" NO_FAMILY_SPECIFIED 5
## 236 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 237 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 238 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 239 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 240 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 241 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 242 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 243 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 244 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 245 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 246 "entolium" NO_FAMILY_SPECIFIED 5
## 247 "entolium" NO_FAMILY_SPECIFIED 5
## 248 "entolium" NO_FAMILY_SPECIFIED 5
## 249 "entolium" NO_FAMILY_SPECIFIED 5
## 250 "entolium" NO_FAMILY_SPECIFIED 5
## 251 "entolium" NO_FAMILY_SPECIFIED 5
## 252 "entolium" NO_FAMILY_SPECIFIED 5
## 253 "entolium" NO_FAMILY_SPECIFIED 5
## 254 "entolium" NO_FAMILY_SPECIFIED 5
## 255 "entolium" NO_FAMILY_SPECIFIED 5
## 256 "entolium" NO_FAMILY_SPECIFIED 5
## 257 "entolium" NO_FAMILY_SPECIFIED 5
## 258 "entolium" NO_FAMILY_SPECIFIED 5
## 259 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 260 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 261 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 262 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 263 "entolium" NO_FAMILY_SPECIFIED 5
## 264 "entolium" NO_FAMILY_SPECIFIED 5
## 265 "entolium" NO_FAMILY_SPECIFIED 5
## 266 "entolium" NO_FAMILY_SPECIFIED 5
## 267 "entolium" NO_FAMILY_SPECIFIED 5
## 268 "entolium" NO_FAMILY_SPECIFIED 5
## 269 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 270 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 271 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 272 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 273 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 274 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 275 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 276 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 277 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 278 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 279 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 280 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 281 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 282 "entolium" NO_FAMILY_SPECIFIED 5
## 283 "entolium" NO_FAMILY_SPECIFIED 5
## 284 "hippuritida" <NA> 13
## 285 "hippuritida" <NA> 13
## 286 "entolium irenense" NO_FAMILY_SPECIFIED 3
## 287 "entolium irenense" NO_FAMILY_SPECIFIED 3
## 288 "hippuritida" <NA> 13
## 289 "hippuritacea" <NA> 10
## 290 "praecaprina" NO_FAMILY_SPECIFIED 5
## 291 "ostreida" <NA> 13
## 292 "ostreida" <NA> 13
## 293 "ostreida" <NA> 13
## 294 "ostreida" <NA> 13
## 295 "ostreida" <NA> 13
## 296 "ostreida" <NA> 13
## 297 "ostreida" <NA> 13
## 298 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 299 "hatayia" NO_FAMILY_SPECIFIED 5
## 300 "praecaprina" NO_FAMILY_SPECIFIED 5
## 301 "branislavia" NO_FAMILY_SPECIFIED 5
## 302 "ostreacea" <NA> 10
## 303 "entolium" NO_FAMILY_SPECIFIED 5
## 304 "entolium" NO_FAMILY_SPECIFIED 5
## 305 "entolium" NO_FAMILY_SPECIFIED 5
## 306 "entolium" NO_FAMILY_SPECIFIED 5
## 307 "ostreida" <NA> 13
## 308 "ostreida" <NA> 13
## 309 "entolium" NO_FAMILY_SPECIFIED 5
## 310 "cryptaulia" NO_FAMILY_SPECIFIED 5
## 311 "ostreida" <NA> 13
## 312 "branislavia" NO_FAMILY_SPECIFIED 5
## 313 "entolium" NO_FAMILY_SPECIFIED 5
## 314 "miseia" NO_FAMILY_SPECIFIED 5
## 315 "miseia" NO_FAMILY_SPECIFIED 5
## 316 "miseia" NO_FAMILY_SPECIFIED 5
## 317 "miseia" NO_FAMILY_SPECIFIED 5
## 318 "miseia" NO_FAMILY_SPECIFIED 5
## 319 "miseia" NO_FAMILY_SPECIFIED 5
## 320 "miseia" NO_FAMILY_SPECIFIED 5
## 321 "miseia" NO_FAMILY_SPECIFIED 5
## 322 "miseia" NO_FAMILY_SPECIFIED 5
## 323 "miseia" NO_FAMILY_SPECIFIED 5
## 324 "miseia" NO_FAMILY_SPECIFIED 5
## 325 "miseia" NO_FAMILY_SPECIFIED 5
## 326 "branislavia" NO_FAMILY_SPECIFIED 5
## 327 "miseia" NO_FAMILY_SPECIFIED 5
## 328 "miseia" NO_FAMILY_SPECIFIED 5
## 329 "branislavia" NO_FAMILY_SPECIFIED 5
## 330 "miseia" NO_FAMILY_SPECIFIED 5
## 331 "miseia" NO_FAMILY_SPECIFIED 5
## 332 "miseia" NO_FAMILY_SPECIFIED 5
## 333 "branislavia" NO_FAMILY_SPECIFIED 5
## 334 "hippuritida" <NA> 13
## 335 "hippuritida" <NA> 13
## 336 "hippuritida" <NA> 13
## 337 "hippuritida" <NA> 13
## 338 "hippuritida" <NA> 13
## 339 "hippuritida" <NA> 13
## 340 "hippuritida" <NA> 13
## 341 "hippuritida" <NA> 13
## 342 "hippuritida" <NA> 13
## 343 "hippuritida" <NA> 13
## 344 "hippuritida" <NA> 13
## 345 "tetravaccinites collignoni" NO_FAMILY_SPECIFIED 3
## 346 "pectinida" <NA> 13
## 347 "pectinida" <NA> 13
## 348 "pectinida" <NA> 13
## 349 "entolium" NO_FAMILY_SPECIFIED 5
## 350 "entolium" NO_FAMILY_SPECIFIED 5
## 351 "entolium" NO_FAMILY_SPECIFIED 5
## 352 "entolium" NO_FAMILY_SPECIFIED 5
## 353 "entolium" NO_FAMILY_SPECIFIED 5
## 354 "entolium" NO_FAMILY_SPECIFIED 5
## 355 "entolium" NO_FAMILY_SPECIFIED 5
## 356 "entolium" NO_FAMILY_SPECIFIED 5
## 357 "entolium" NO_FAMILY_SPECIFIED 5
## 358 "entolium" NO_FAMILY_SPECIFIED 5
## 359 "praecaprina" NO_FAMILY_SPECIFIED 5
## 360 "pectinida" <NA> 13
## 361 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 362 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 363 "entolium" NO_FAMILY_SPECIFIED 5
## 364 "entolium" NO_FAMILY_SPECIFIED 5
## 365 "entolium" NO_FAMILY_SPECIFIED 5
## 366 "entolium" NO_FAMILY_SPECIFIED 5
## 367 "entolium" NO_FAMILY_SPECIFIED 5
## 368 "entolium" NO_FAMILY_SPECIFIED 5
## 369 "entolium" NO_FAMILY_SPECIFIED 5
## 370 "hippuritida" <NA> 13
## 371 "ostreida" <NA> 13
## 372 "ostreida" <NA> 13
## 373 "entolium" NO_FAMILY_SPECIFIED 5
## 374 "entolium" NO_FAMILY_SPECIFIED 5
## 375 "entolium" NO_FAMILY_SPECIFIED 5
## 376 "entolium" NO_FAMILY_SPECIFIED 5
## 377 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 378 "entolium" NO_FAMILY_SPECIFIED 5
## 379 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 380 "pectinoidea" <NA> 10
## 381 "hippuritida" <NA> 13
## 382 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 383 "radiolitidina" <NA> 12
## 384 "ostreidina" <NA> 12
## 385 "ostreidina" <NA> 12
## 386 "ostreidina" <NA> 12
## 387 "pectinoidea" <NA> 10
errant_taxonomy %>%
group_by(rnk) %>%
summarize(n = n())
## # A tibble: 6 × 2
## rnk n
## <dbl> <int>
## 1 3 118
## 2 4 1
## 3 5 186
## 4 10 29
## 5 12 6
## 6 13 47
errant_taxonomy %>% pull(tna_trimmed) %>% unique()
## [1] "entolium" "entolium membranaceum"
## [3] "balabania" "dechaseauxia"
## [5] "hardaghia" "entolium orbiculare"
## [7] "entolium demissus" "ostreidina"
## [9] "megadiceras" "hippuritacea"
## [11] "hippuritida" "entolium utukokense"
## [13] "ostreacea" "rhedensia"
## [15] "praecaprina" "entolium "
## [17] "ostreida" "tetravaccinites"
## [19] "pseudopironaea" "pectinoidea"
## [21] "ostreoidea" "hatayia"
## [23] "branislavia" "entolium irenense"
## [25] "cryptaulia" "miseia"
## [27] "tetravaccinites collignoni" "pectinida"
## [29] "radiolitidina"
This finds 387 records that wither have no family or NA.
Most are identified to genus level, but a similar number are species. It
turns out that those with NA in the family are only
identified to a higher taxonomic level, so we can leave them alone.
The code below instead shows the lower taxa without a family specified. These are ripe for inclusion.
errant_families <-
bivalve_cleaned %>%
filter(fml == "NO_FAMILY_SPECIFIED") %>%
select(tna_trimmed, fml, rnk)
print(errant_families, n = Inf)
## # A tibble: 305 × 3
## tna_trimmed fml rnk
## <chr> <chr> <dbl>
## 1 "entolium" NO_FAMILY_SPECIFIED 5
## 2 "entolium" NO_FAMILY_SPECIFIED 5
## 3 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 4 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 5 "entolium" NO_FAMILY_SPECIFIED 5
## 6 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 7 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 8 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 9 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 10 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 11 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 12 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 13 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 14 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 15 "entolium" NO_FAMILY_SPECIFIED 5
## 16 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 17 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 18 "entolium" NO_FAMILY_SPECIFIED 5
## 19 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 20 "entolium" NO_FAMILY_SPECIFIED 5
## 21 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 22 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 23 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 24 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 25 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 26 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 27 "balabania" NO_FAMILY_SPECIFIED 5
## 28 "balabania" NO_FAMILY_SPECIFIED 5
## 29 "balabania" NO_FAMILY_SPECIFIED 5
## 30 "balabania" NO_FAMILY_SPECIFIED 5
## 31 "dechaseauxia" NO_FAMILY_SPECIFIED 5
## 32 "hardaghia" NO_FAMILY_SPECIFIED 5
## 33 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 34 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 35 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 36 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 37 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 38 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 39 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 40 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 41 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 42 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 43 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 44 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 45 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 46 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 47 "entolium" NO_FAMILY_SPECIFIED 5
## 48 "entolium" NO_FAMILY_SPECIFIED 5
## 49 "entolium" NO_FAMILY_SPECIFIED 5
## 50 "entolium" NO_FAMILY_SPECIFIED 5
## 51 "entolium" NO_FAMILY_SPECIFIED 5
## 52 "entolium" NO_FAMILY_SPECIFIED 5
## 53 "entolium" NO_FAMILY_SPECIFIED 5
## 54 "entolium" NO_FAMILY_SPECIFIED 5
## 55 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 56 "entolium" NO_FAMILY_SPECIFIED 5
## 57 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 58 "entolium" NO_FAMILY_SPECIFIED 5
## 59 "entolium" NO_FAMILY_SPECIFIED 5
## 60 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 61 "entolium" NO_FAMILY_SPECIFIED 5
## 62 "entolium" NO_FAMILY_SPECIFIED 5
## 63 "entolium" NO_FAMILY_SPECIFIED 5
## 64 "entolium" NO_FAMILY_SPECIFIED 5
## 65 "entolium" NO_FAMILY_SPECIFIED 5
## 66 "entolium" NO_FAMILY_SPECIFIED 5
## 67 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 68 "entolium" NO_FAMILY_SPECIFIED 5
## 69 "entolium" NO_FAMILY_SPECIFIED 5
## 70 "entolium" NO_FAMILY_SPECIFIED 5
## 71 "entolium" NO_FAMILY_SPECIFIED 5
## 72 "entolium" NO_FAMILY_SPECIFIED 5
## 73 "entolium" NO_FAMILY_SPECIFIED 5
## 74 "entolium" NO_FAMILY_SPECIFIED 5
## 75 "entolium" NO_FAMILY_SPECIFIED 5
## 76 "entolium" NO_FAMILY_SPECIFIED 5
## 77 "entolium" NO_FAMILY_SPECIFIED 5
## 78 "entolium" NO_FAMILY_SPECIFIED 5
## 79 "entolium" NO_FAMILY_SPECIFIED 5
## 80 "entolium" NO_FAMILY_SPECIFIED 5
## 81 "entolium" NO_FAMILY_SPECIFIED 5
## 82 "entolium" NO_FAMILY_SPECIFIED 5
## 83 "entolium" NO_FAMILY_SPECIFIED 5
## 84 "entolium" NO_FAMILY_SPECIFIED 5
## 85 "entolium" NO_FAMILY_SPECIFIED 5
## 86 "entolium" NO_FAMILY_SPECIFIED 5
## 87 "entolium" NO_FAMILY_SPECIFIED 5
## 88 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 89 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 90 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 91 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 92 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 93 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 94 "entolium" NO_FAMILY_SPECIFIED 5
## 95 "entolium" NO_FAMILY_SPECIFIED 5
## 96 "megadiceras" NO_FAMILY_SPECIFIED 5
## 97 "megadiceras" NO_FAMILY_SPECIFIED 5
## 98 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 99 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 100 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 101 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 102 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 103 "entolium" NO_FAMILY_SPECIFIED 5
## 104 "entolium" NO_FAMILY_SPECIFIED 5
## 105 "entolium" NO_FAMILY_SPECIFIED 5
## 106 "entolium" NO_FAMILY_SPECIFIED 5
## 107 "entolium" NO_FAMILY_SPECIFIED 5
## 108 "entolium" NO_FAMILY_SPECIFIED 5
## 109 "entolium" NO_FAMILY_SPECIFIED 5
## 110 "entolium" NO_FAMILY_SPECIFIED 5
## 111 "entolium" NO_FAMILY_SPECIFIED 5
## 112 "entolium" NO_FAMILY_SPECIFIED 5
## 113 "entolium" NO_FAMILY_SPECIFIED 5
## 114 "entolium" NO_FAMILY_SPECIFIED 5
## 115 "entolium" NO_FAMILY_SPECIFIED 5
## 116 "entolium" NO_FAMILY_SPECIFIED 5
## 117 "entolium" NO_FAMILY_SPECIFIED 5
## 118 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 119 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 120 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 121 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 122 "entolium" NO_FAMILY_SPECIFIED 5
## 123 "entolium" NO_FAMILY_SPECIFIED 5
## 124 "entolium" NO_FAMILY_SPECIFIED 5
## 125 "entolium" NO_FAMILY_SPECIFIED 5
## 126 "entolium" NO_FAMILY_SPECIFIED 5
## 127 "entolium" NO_FAMILY_SPECIFIED 5
## 128 "entolium" NO_FAMILY_SPECIFIED 5
## 129 "entolium" NO_FAMILY_SPECIFIED 5
## 130 "entolium" NO_FAMILY_SPECIFIED 5
## 131 "entolium" NO_FAMILY_SPECIFIED 5
## 132 "entolium" NO_FAMILY_SPECIFIED 5
## 133 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 134 "rhedensia" NO_FAMILY_SPECIFIED 5
## 135 "entolium" NO_FAMILY_SPECIFIED 5
## 136 "entolium" NO_FAMILY_SPECIFIED 5
## 137 "entolium" NO_FAMILY_SPECIFIED 5
## 138 "entolium" NO_FAMILY_SPECIFIED 5
## 139 "entolium" NO_FAMILY_SPECIFIED 5
## 140 "entolium" NO_FAMILY_SPECIFIED 5
## 141 "entolium" NO_FAMILY_SPECIFIED 5
## 142 "entolium" NO_FAMILY_SPECIFIED 5
## 143 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 144 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 145 "entolium" NO_FAMILY_SPECIFIED 5
## 146 "entolium" NO_FAMILY_SPECIFIED 5
## 147 "praecaprina" NO_FAMILY_SPECIFIED 5
## 148 "praecaprina" NO_FAMILY_SPECIFIED 5
## 149 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 150 "entolium demissus" NO_FAMILY_SPECIFIED 3
## 151 "entolium " NO_FAMILY_SPECIFIED 4
## 152 "entolium" NO_FAMILY_SPECIFIED 5
## 153 "entolium" NO_FAMILY_SPECIFIED 5
## 154 "entolium" NO_FAMILY_SPECIFIED 5
## 155 "entolium" NO_FAMILY_SPECIFIED 5
## 156 "entolium" NO_FAMILY_SPECIFIED 5
## 157 "entolium" NO_FAMILY_SPECIFIED 5
## 158 "entolium" NO_FAMILY_SPECIFIED 5
## 159 "entolium" NO_FAMILY_SPECIFIED 5
## 160 "entolium" NO_FAMILY_SPECIFIED 5
## 161 "entolium" NO_FAMILY_SPECIFIED 5
## 162 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 163 "tetravaccinites" NO_FAMILY_SPECIFIED 5
## 164 "pseudopironaea" NO_FAMILY_SPECIFIED 5
## 165 "entolium" NO_FAMILY_SPECIFIED 5
## 166 "entolium" NO_FAMILY_SPECIFIED 5
## 167 "entolium" NO_FAMILY_SPECIFIED 5
## 168 "entolium" NO_FAMILY_SPECIFIED 5
## 169 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 170 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 171 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 172 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 173 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 174 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 175 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 176 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 177 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 178 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 179 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 180 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 181 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 182 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 183 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 184 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 185 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 186 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 187 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 188 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 189 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 190 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 191 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 192 "hatayia" NO_FAMILY_SPECIFIED 5
## 193 "branislavia" NO_FAMILY_SPECIFIED 5
## 194 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 195 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 196 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 197 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 198 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 199 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 200 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 201 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 202 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 203 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 204 "entolium" NO_FAMILY_SPECIFIED 5
## 205 "entolium" NO_FAMILY_SPECIFIED 5
## 206 "entolium" NO_FAMILY_SPECIFIED 5
## 207 "entolium" NO_FAMILY_SPECIFIED 5
## 208 "entolium" NO_FAMILY_SPECIFIED 5
## 209 "entolium" NO_FAMILY_SPECIFIED 5
## 210 "entolium" NO_FAMILY_SPECIFIED 5
## 211 "entolium" NO_FAMILY_SPECIFIED 5
## 212 "entolium" NO_FAMILY_SPECIFIED 5
## 213 "entolium" NO_FAMILY_SPECIFIED 5
## 214 "entolium" NO_FAMILY_SPECIFIED 5
## 215 "entolium" NO_FAMILY_SPECIFIED 5
## 216 "entolium" NO_FAMILY_SPECIFIED 5
## 217 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 218 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 219 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 220 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 221 "entolium" NO_FAMILY_SPECIFIED 5
## 222 "entolium" NO_FAMILY_SPECIFIED 5
## 223 "entolium" NO_FAMILY_SPECIFIED 5
## 224 "entolium" NO_FAMILY_SPECIFIED 5
## 225 "entolium" NO_FAMILY_SPECIFIED 5
## 226 "entolium" NO_FAMILY_SPECIFIED 5
## 227 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 228 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 229 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 230 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 231 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 232 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 233 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 234 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 235 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 236 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 237 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 238 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 239 "entolium membranaceum" NO_FAMILY_SPECIFIED 3
## 240 "entolium" NO_FAMILY_SPECIFIED 5
## 241 "entolium" NO_FAMILY_SPECIFIED 5
## 242 "entolium irenense" NO_FAMILY_SPECIFIED 3
## 243 "entolium irenense" NO_FAMILY_SPECIFIED 3
## 244 "praecaprina" NO_FAMILY_SPECIFIED 5
## 245 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 246 "hatayia" NO_FAMILY_SPECIFIED 5
## 247 "praecaprina" NO_FAMILY_SPECIFIED 5
## 248 "branislavia" NO_FAMILY_SPECIFIED 5
## 249 "entolium" NO_FAMILY_SPECIFIED 5
## 250 "entolium" NO_FAMILY_SPECIFIED 5
## 251 "entolium" NO_FAMILY_SPECIFIED 5
## 252 "entolium" NO_FAMILY_SPECIFIED 5
## 253 "entolium" NO_FAMILY_SPECIFIED 5
## 254 "cryptaulia" NO_FAMILY_SPECIFIED 5
## 255 "branislavia" NO_FAMILY_SPECIFIED 5
## 256 "entolium" NO_FAMILY_SPECIFIED 5
## 257 "miseia" NO_FAMILY_SPECIFIED 5
## 258 "miseia" NO_FAMILY_SPECIFIED 5
## 259 "miseia" NO_FAMILY_SPECIFIED 5
## 260 "miseia" NO_FAMILY_SPECIFIED 5
## 261 "miseia" NO_FAMILY_SPECIFIED 5
## 262 "miseia" NO_FAMILY_SPECIFIED 5
## 263 "miseia" NO_FAMILY_SPECIFIED 5
## 264 "miseia" NO_FAMILY_SPECIFIED 5
## 265 "miseia" NO_FAMILY_SPECIFIED 5
## 266 "miseia" NO_FAMILY_SPECIFIED 5
## 267 "miseia" NO_FAMILY_SPECIFIED 5
## 268 "miseia" NO_FAMILY_SPECIFIED 5
## 269 "branislavia" NO_FAMILY_SPECIFIED 5
## 270 "miseia" NO_FAMILY_SPECIFIED 5
## 271 "miseia" NO_FAMILY_SPECIFIED 5
## 272 "branislavia" NO_FAMILY_SPECIFIED 5
## 273 "miseia" NO_FAMILY_SPECIFIED 5
## 274 "miseia" NO_FAMILY_SPECIFIED 5
## 275 "miseia" NO_FAMILY_SPECIFIED 5
## 276 "branislavia" NO_FAMILY_SPECIFIED 5
## 277 "tetravaccinites collignoni" NO_FAMILY_SPECIFIED 3
## 278 "entolium" NO_FAMILY_SPECIFIED 5
## 279 "entolium" NO_FAMILY_SPECIFIED 5
## 280 "entolium" NO_FAMILY_SPECIFIED 5
## 281 "entolium" NO_FAMILY_SPECIFIED 5
## 282 "entolium" NO_FAMILY_SPECIFIED 5
## 283 "entolium" NO_FAMILY_SPECIFIED 5
## 284 "entolium" NO_FAMILY_SPECIFIED 5
## 285 "entolium" NO_FAMILY_SPECIFIED 5
## 286 "entolium" NO_FAMILY_SPECIFIED 5
## 287 "entolium" NO_FAMILY_SPECIFIED 5
## 288 "praecaprina" NO_FAMILY_SPECIFIED 5
## 289 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 290 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 291 "entolium" NO_FAMILY_SPECIFIED 5
## 292 "entolium" NO_FAMILY_SPECIFIED 5
## 293 "entolium" NO_FAMILY_SPECIFIED 5
## 294 "entolium" NO_FAMILY_SPECIFIED 5
## 295 "entolium" NO_FAMILY_SPECIFIED 5
## 296 "entolium" NO_FAMILY_SPECIFIED 5
## 297 "entolium" NO_FAMILY_SPECIFIED 5
## 298 "entolium" NO_FAMILY_SPECIFIED 5
## 299 "entolium" NO_FAMILY_SPECIFIED 5
## 300 "entolium" NO_FAMILY_SPECIFIED 5
## 301 "entolium" NO_FAMILY_SPECIFIED 5
## 302 "entolium orbiculare" NO_FAMILY_SPECIFIED 3
## 303 "entolium" NO_FAMILY_SPECIFIED 5
## 304 "entolium utukokense" NO_FAMILY_SPECIFIED 3
## 305 "entolium utukokense" NO_FAMILY_SPECIFIED 3
We can write this out to a file that can then be uploaded to LifeWatch and use that to compare against its databases.
errant_families %>%
select(tna_trimmed) %>%
unique() %>%
transmute(ScientificName = tna_trimmed) %>%
write_csv("./output/uncertain_taxa.csv")
This list of uncertain taxa can then be compared against an online data base, such as LifeWatch.
lifewatch_taxon_check <-
read_csv("./output/19470_uncertain_taxa.csv")
## New names:
## * NOTUSED_scientificname -> NOTUSED_scientificname...6
## * NOTUSED_scientificname -> NOTUSED_scientificname...15
## * NOTUSED_scientificname -> NOTUSED_scientificname...20
## * NOTUSED_scientificname -> NOTUSED_scientificname...26
## * NOTUSED_scientificname -> NOTUSED_scientificname...30
## * ...
## Rows: 20 Columns: 64
## ── Column specification ──────────────────────────────────────────────────
## Delimiter: ","
## chr (31): scientificname, note_fuzzy_col, environment_aphia_WORMS, nam...
## dbl (16): required_fields_check, aphiaid_WORMS, scientificnameid, taxo...
## lgl (17): datasetname_fuzzy_col, taxonomicstatus_fuzzy_col, taxonrank_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
print(lifewatch_taxon_check)
## # A tibble: 20 × 64
## scientificname required_fields… datasetname_fuz… taxonomicstatus…
## <chr> <dbl> <lgl> <lgl>
## 1 Entolium 1 NA NA
## 2 Entolium membranace… 1 NA NA
## 3 Balabania 1 NA NA
## 4 Dechaseauxia 1 NA NA
## 5 Hardaghia 1 NA NA
## 6 Entolium orbiculare 1 NA NA
## 7 Entolium demissus 1 NA NA
## 8 Megadiceras 1 NA NA
## 9 Entolium utukokense 1 NA NA
## 10 Rhedensia 1 NA NA
## 11 Praecaprina 1 NA NA
## 12 Entolium (Entolium) 1 NA NA
## 13 Tetravaccinites 1 NA NA
## 14 Pseudopironaea 1 NA NA
## 15 Hatayia 1 NA NA
## 16 Branislavia 1 NA NA
## 17 Entolium irenense 1 NA NA
## 18 Cryptaulia 1 NA NA
## 19 Miseia 1 NA NA
## 20 Tetravaccinites col… 1 NA NA
## # … with 60 more variables: taxonrank_fuzzy_col <lgl>,
## # NOTUSED_scientificname...6 <lgl>, scientificnameid_fuzzy_col <lgl>,
## # isextinct_fuzzy_col <lgl>, match_type_fuzzy_col <lgl>,
## # match_count_fuzzy_col <lgl>, note_fuzzy_col <chr>,
## # environment_aphia_WORMS <chr>, name_aphia_WORMS <chr>,
## # aphiaid_WORMS <dbl>, NOTUSED_scientificname...15 <chr>,
## # scientificnameid <dbl>, status_aphia_WORMS <chr>, …
We can also do more fun things like plot the occurrences on a map. Useful guides to this are found in three blog posts starting here. Working with mapping data is more involved than simple points, particularly if you want to change the projection away from the standard Mercator. The package sf (short for ‘simple features’, describing how the data is stored) can deal with all of this and then in plotting the projection can be changed. A rule of thumb is make sure to assign the coordinate reference system (CRS) when creating map objects.
Package rnaturalearth has regularly updated maps that
should be better to use than in the built-in maps library.
Here we download the country outlines in sf format.
world_map <-
ne_countries(returnclass = "sf")
Next, convert the occurrences locations into an sf object
also. I’ve filtered down to occurrences identified to species level and
then taken a random sample (‘slice’) of 1000 rows. The final stage is
convert to an sf object using the longitude and latitude
columns, with the standard WGS84 CRS (indicated with 4326
in this case). You can find this code to use looking at epsg.io, searching for the projection,
and using the code given. In the case of WGS84 coordinates (standard
latitude and longitude) shown this is 4326.
occurrences_to_plot <-
bivalve_cleaned %>%
filter(rnk == 3) %>%
slice_sample(n = 1000) %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326, agr = "constant")
Now, we put these together into the plot using the
geom_sf function to build up the layers: map outline first,
then points over the top. The colour scale is good for colour-blind
people, so is a good option to prefer. Here I modify the map projection
using coord_sf(crs) into the Robinson
projection.
The essence of plotting with geom_sf is to add the
different layers one by one. In this case, the first call plots the
country outlines then the second line plots the occurrence points over
the top; scale_colour_viridis is a good set of plotting
colours to use as they are colour blind-friendly; and the final theme
options remove the border and the grey background usual in
ggplot plots.
ggplot() +
geom_sf(data = world_map, inherit.aes = FALSE) +
geom_sf(data = occurrences_to_plot, mapping = aes(colour = fml)) +
scale_colour_viridis(discrete = TRUE) +
coord_sf(crs = "+proj=robin") +
theme_bw() +
theme(
legend.position = "none",
panel.border = element_blank()
)
Map of Bivalvia localities from the Cretaceous and Palaeogene.
We use the palaeogeographical reconstructions from the combined Muller2019-Young2019-Cao2020 (MYC) model provided by GPlates (https://www.gplates.org) (Müller et al. 2019; Young et al. 2019; X. Cao et al. 2022; Torsvik et al. 2019). This provides a nice visualization of the palaeogeography including locations of shallow marine environments, land, mountains, and ice caps. We exported these four layers reconstructed to 68 Ma in the latest Cretaceous and plot them as a base map with modern coastlines indicated. The data files for the reconstructions are provided with GPlates or can be downloaded from Earthbyte.
The different layers (shallow marine, land, mountain, ice cap) are
included in four files in the
./data/palaeogeographical_reconstructions directory and
read into a list then combined using the functions below. Combining
these data sets into one makes it quicker to plot and colour the
different layers in the base map.
palaeogeo_file_layers <-
c("Shallow marine" = "sm", "Landmass" = "lm", "Mountain" = "m", "Ice cap" = "i")
read_geojson <-
function(prefix, filename = "./data/palaeogeographical_reconstructions/id_402_2_reconstructed_68.00Ma.gmt") {
# Read a list of GeoJSON files and combine into a single sf collection.
#
# Arguments:
# prefix: layer prefixes for the data files ("sm", "lm", "m", "i") taken from Cao et al. (2017).
# filename: rest of GeoJSON file name.
#
# Returns:
# A single feature collection with geometries and ID column ("id").
str_replace(filename, "id", prefix) %>%
st_read() %>%
add_column(id = prefix)
}
map_data <-
palaeogeo_file_layers %>%
purrr::map(read_geojson) %>%
do.call(rbind, .) %>%
mutate(
id = factor(id, levels = palaeogeo_file_layers, labels = names(palaeogeo_file_layers))
)
## Reading layer `sm_402_2_reconstructed_68.00Ma' from data source
## `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/sm_402_2_reconstructed_68.00Ma.gmt'
## using driver `OGR_GMT'
## Simple feature collection with 829 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -83.18105 xmax: 180 ymax: 89.52032
## Geodetic CRS: WGS 84
## Reading layer `lm_402_2_reconstructed_68.00Ma' from data source
## `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/lm_402_2_reconstructed_68.00Ma.gmt'
## using driver `OGR_GMT'
## Simple feature collection with 323 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -90 xmax: 180 ymax: 86.3155
## Geodetic CRS: WGS 84
## Reading layer `m_402_2_reconstructed_68.00Ma' from data source
## `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/m_402_2_reconstructed_68.00Ma.gmt'
## using driver `OGR_GMT'
## Simple feature collection with 312 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.50942 xmax: 180 ymax: 84.56318
## Geodetic CRS: WGS 84
## Reading layer `i_402_2_reconstructed_68.00Ma' from data source
## `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/i_402_2_reconstructed_68.00Ma.gmt'
## using driver `OGR_GMT'
## Simple feature collection with 4 features and 15 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 4.391382 ymin: -83.96268 xmax: 103.195 ymax: 22.29246
## Geodetic CRS: WGS 84
base_map <-
ggplot() +
geom_sf(data = map_data, aes(fill = id), colour = NA) +
scale_discrete_manual(
# colours for base map layers
values =
c(
"Ice cap" = "#DAD3FF",
"Landmass" = "#FFD23A",
"Mountain" = "#FF8D51",
"Shallow marine" = "#45D8FF"
),
aesthetics = c("fill"),
name = "Palaeogeography"
) +
theme_bw() +
theme(panel.border = element_blank())
modern_coastlines <-
st_read("./data/palaeogeographical_reconstructions/Global_EarthByte_GPlates_PresentDay_Coastlines_Polyline_reconstructed_68.00Ma.gmt")
## Reading layer `Global_EarthByte_GPlates_PresentDay_Coastlines_Polyline_reconstructed_68.00Ma' from data source `/Users/glbcm/GitHub/kpg_extinction/data/palaeogeographical_reconstructions/Global_EarthByte_GPlates_PresentDay_Coastlines_Polyline_reconstructed_68.00Ma.gmt'
## using driver `OGR_GMT'
## Simple feature collection with 1945 features and 23 fields
## Geometry type: MULTILINESTRING
## Dimension: XY
## Bounding box: xmin: -180 ymin: -87.88832 xmax: 180 ymax: 83.53398
## Geodetic CRS: WGS 84
base_map +
geom_sf(data = modern_coastlines, colour = "grey60") +
coord_sf(crs = "+proj=robin")
Palaeogeographical reconstruction in the latest Cretaceous (68 Ma), with modern coastlines indicated, based on the environmental reconstrution of W. Cao et al. (2017) and the combined rotaion models of Müller et al. (2019), Young et al. (2019), and X. Cao et al. (2022), with correction to the Pacific by Torsvik et al. (2019).
We have the modern localities for our bivalve occurrences, but need
to rotate these to their position at the end of the Cretaceous
(68 Ma) to get their palaeocoordinates. The GPlates Web Service can do
this online or through docker, but doesn’t offer the rotation model used
for the combined MYC palaeogeographical reconstruction, and can be quite
slow to do many requests — and we have tens-of-thousands of bivalve
occurrences. Instead, we field this out to pyGPlates.
This step of reconstruction has been done, so the file
./data/reconstructed_68Ma.gmt can be read in directly.
ek_occurrences <-
bivalve_cleaned %>%
filter(lag >= 66 & lag < 70)
modern_coords <-
ek_occurrences %>%
select(c("oid", "tna", "rnk", "eag", "lag", "fml", "lng", "lat", "tna_trimmed")) %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326)
st_write(modern_coords, "./data/kpg_bivalve_modern_coordinates.gmt", driver = "OGR_GMT")
We have to reduce the number of columns above as OGR_GMT
seems only to save the first 68.
Instructions to install pyGPlates can be found here,
including the necessary Python packages (specifically
proj). Anaconda/Miniconda is a
good Python managing system in which you can also create project
environments. The code in the python blocks below can be run in a python
environment — separate to the main R environment — or can be run
directly within R using the reticulate package (Ushey, Allaire, and Tang 2022).
There are two steps to reconstructing the palaeocoordinates: (1) assigning the localities to the static polygons (i.e. plate fragments that move with continental drift), and (2) rotating the plate fragments and localities to the palaeolocations. First we assign the points and output a GPML file that can be visualized in GPlates directly.
import pygplates
# Load one or more rotation files into a rotation model.
rotation_model = pygplates.RotationModel('./data/palaeogeographical_reconstructions/Muller2019-Young2019-Cao2020_CombinedRotations.rot')
# Filename for the static plate boundaries
static_polygons_filename = './data/palaeogeographical_reconstructions/Global_EarthByte_GPlates_PresentDay_StaticPlatePolygons.gpmlz'
# Load some features.
point_features = pygplates.FeatureCollection('./data/kpg_bivalve_modern_coordinates.gmt')
# Output filename for assigned points
assigned_points_output = './data/bivalve_assigned_locations.gpml'
# Place point occurrences onto plate static polygons
assigned_points = pygplates.partition_into_plates(
static_polygons_filename,
rotation_model,
point_features,
properties_to_copy = [
pygplates.PartitionProperty.reconstruction_plate_id,
pygplates.PartitionProperty.valid_time_period
]
)
# Write assigned points to a GPML file that can be used in GPlates
assigned_points_collection = pygplates.FeatureCollection(assigned_points)
assigned_points_collection.write(assigned_points_output)
Next we rotate the palaeolocations to 68 Ma and export an OGR-GMT file of these palaeocoordinates.
# Reconstruct features to this geological time.
reconstruction_time = 68
# The filename of the exported reconstructed geometries.
# It's a shapefile called 'reconstructed_68Ma.shp'.
export_filename = './data/reconstructed_{0}Ma.gmt'.format(reconstruction_time)
# Reconstruct the features to the reconstruction time and export them to a shapefile.
pygplates.reconstruct(assigned_points_collection, rotation_model, export_filename, reconstruction_time)
These converted locations can be read back into R and plotted over the base map.
palaeo_coords <-
st_read("./data/reconstructed_68Ma.gmt")
## Reading layer `reconstructed_68Ma' from data source
## `/Users/glbcm/GitHub/kpg_extinction/data/reconstructed_68Ma.gmt'
## using driver `OGR_GMT'
## Simple feature collection with 8974 features and 10 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -179.1851 ymin: -65.69038 xmax: 176.9147 ymax: 79.60964
## Geodetic CRS: WGS 84
palaeo_coords <-
palaeo_coords %>%
add_column(
plng = st_coordinates(palaeo_coords)[, 1],
plat = st_coordinates(palaeo_coords)[, 2]
)
base_map +
geom_sf(data = palaeo_coords) +
coord_sf(crs = "+proj=robin")
Palaeolocations of marine bivalve occurrences in the latest Cretaceous (70–66 Ma) plotted on a palaeogeographical map at 68 Ma.
We’ve got palaeoenvironmental data from an analysis done by Ridgwell and Schmidt (2010). The data is stored in a netCDF file: this is a compact way to store data on geographical scales where you have values at various locations. It’s different to the mapping data above: this netCDF is raster with data in points, whereas the maps above were vector and had lines of data. Ridgwell and Schmidt (2010) did a reconstruction of the palaeoenvironment at the K/Pg boundary using Biogem from the GeNie package, one of a series of Earth-systems models. There are two steps to getting data from a netCDF file into R: (1) loading the metadata, then (2) pulling the data you want into R. I get some warning about deprecated tidyverse functions, but I don’t think this is anything to worry about (yet). This page has a good guide to getting started with netCDF.
metadata <-
ncdump::NetCDF("./data/EXAMPLE.p0055c.RidgwellSchmidt2010.SPIN1/biogem/fields_biogem_3d.nc")
# look at the variables
print(metadata$variable, n = Inf)
## # A tibble: 48 × 16
## name ndims natts prec units longname group_index storage shuffle
## <chr> <int> <int> <chr> <chr> <chr> <int> <dbl> <lgl>
## 1 year 1 2 float "" year 1 2 FALSE
## 2 grid_level 2 5 int "n/a" grid de… 1 1 FALSE
## 3 grid_mask 2 4 float "n/a" land-se… 1 1 FALSE
## 4 grid_topo 2 4 float "m" ocean d… 1 1 FALSE
## 5 ocn_temp 4 4 float "deg… tempera… 1 2 FALSE
## 6 ocn_sal 4 4 float "PSU" salinity 1 2 FALSE
## 7 ocn_DIC 4 4 float "mol… dissolv… 1 2 FALSE
## 8 ocn_DIC_1… 4 4 float "o/o… d13C of… 1 2 FALSE
## 9 ocn_PO4 4 4 float "mol… dissolv… 1 2 FALSE
## 10 ocn_O2 4 4 float "mol… dissolv… 1 2 FALSE
## 11 ocn_ALK 4 4 float "mol… alkalin… 1 2 FALSE
## 12 ocn_DOM_C 4 4 float "mol… dissolv… 1 2 FALSE
## 13 ocn_DOM_C… 4 4 float "o/o… d13C of… 1 2 FALSE
## 14 ocn_DOM_P 4 4 float "mol… dissolv… 1 2 FALSE
## 15 ocn_Ca 4 4 float "mol… dissolv… 1 2 FALSE
## 16 ocn_SO4 4 4 float "mol… dissolv… 1 2 FALSE
## 17 ocn_H2S 4 4 float "mol… dissolv… 1 2 FALSE
## 18 ocn_Mg 4 4 float "mol… dissolv… 1 2 FALSE
## 19 carb_H 4 2 float "" carbona… 1 2 FALSE
## 20 carb_fug_… 4 2 float "" carbona… 1 2 FALSE
## 21 carb_conc… 4 2 float "" carbona… 1 2 FALSE
## 22 carb_conc… 4 2 float "" carbona… 1 2 FALSE
## 23 carb_conc… 4 2 float "" carbona… 1 2 FALSE
## 24 carb_ohm_… 4 2 float "" carbona… 1 2 FALSE
## 25 carb_ohm_… 4 2 float "" carbona… 1 2 FALSE
## 26 carb_dCO3… 4 2 float "" carbona… 1 2 FALSE
## 27 carb_dCO3… 4 2 float "" carbona… 1 2 FALSE
## 28 carb_RF0 4 2 float "" carbona… 1 2 FALSE
## 29 diag_NH4_… 4 3 float "mol… product… 1 2 FALSE
## 30 diag_NH4_… 4 3 float "mol… product… 1 2 FALSE
## 31 diag_NO3_… 4 3 float "mol… product… 1 2 FALSE
## 32 diag_NO3_… 4 3 float "mol… product… 1 2 FALSE
## 33 diag_SO4_… 4 3 float "mol… product… 1 2 FALSE
## 34 diag_SO4_… 4 3 float "mol… product… 1 2 FALSE
## 35 diag_dCH4 4 3 float "mol… product… 1 2 FALSE
## 36 phys_u 4 4 float "m/s" ocean v… 1 2 FALSE
## 37 phys_v 4 4 float "m/s" ocean v… 1 2 FALSE
## 38 phys_w 4 4 float "m/s" ocean v… 1 2 FALSE
## 39 misc_pH 4 3 float "pH … ocean pH 1 2 FALSE
## 40 misc_rCaC… 4 3 float "n/a" CaCO3 t… 1 2 FALSE
## 41 misc_frac… 4 3 float "n/a" POC fra… 1 2 FALSE
## 42 misc_frac… 4 3 float "n/a" CaCO3 f… 1 2 FALSE
## 43 misc_rho 4 3 float "kg … ocean d… 1 2 FALSE
## 44 grid_A 4 3 float "m2" ocean c… 1 2 FALSE
## 45 grid_dD 4 3 float "m2" ocean c… 1 2 FALSE
## 46 carb_d13C… 4 3 float "o/o… d13C of… 1 2 FALSE
## 47 carb_d13C… 4 3 float "o/o… d13C of… 1 2 FALSE
## 48 carb_d13C… 4 3 float "o/o… d13C of… 1 2 FALSE
## # … with 7 more variables: compression <lgl>, unlim <lgl>,
## # make_missing_value <lgl>, missval <dbl>, hasAddOffset <lgl>,
## # hasScaleFact <lgl>, id <dbl>
# load the desired variables
kpg_env <-
nc_open("./data/EXAMPLE.p0055c.RidgwellSchmidt2010.SPIN1/biogem/fields_biogem_3d.nc")
Loading the data gives us lots of lovely information. The important thing to note is that all the environment variables have values in fours components: longitude, latitude, depth, and time. The last of these is because GeNie runs for a set number of simulated years to reach a steady state. In this case it’s for 10,000 years, so we’ll want to use the values from year 10,000.
lon <-
ncvar_get(kpg_env, "lon")
lat <-
ncvar_get(kpg_env, "lat")
temp <-
ncvar_get(kpg_env, "ocn_temp")
dim(temp)
## [1] 36 36 16 13
The dimensions of the temperature values are 36, 36, 16, and 13. It’s a four-dimensional array with 36 values each for longitude and latitude, 16 depth values, and 13 time values. The ordering means that rows are longitude, columns are latitude, and depth and time are printed as sequential matrices. We can plot this quickly.
image(lon, lat, temp[,,1,13], asp = 1)
But his isn’t too useful and needs some cleaning up.
miss_val <- ncatt_get(kpg_env, "ocn_temp", "missing_value")
temp[temp == miss_val$value] <- NA
surface_temp <- temp[,,1,13]
grid = expand.grid(lon = lon, lat = lat)
grid$ocn_temp <- c(surface_temp)
temp_plot <-
st_as_stars(grid, dims = c("lon", "lat"), values = "ocn_temp", crs = 4326)
ggplot() +
geom_stars(data = temp_plot) + #, aes(x = lon, y = lat, fill = ocn_temp)) +
geom_sf(data = modern_coastlines, colour = "grey60") +
scale_fill_viridis(option = "inferno") +
theme_void()
You can see that the coastlines and temperatures don’t quite match up. The projections must be different. We can either reproject the coastlines onto the 2010 data or find some more recent temperature data using this outlines.
The last step here is to pull out the environmental values for each of the occurrences.
occ_coord <-
palaeo_coords %>%
as_data_frame() %>%
select(c("plng", "plat")) %>%
SpatialPoints()
temp_coord <- SpatialPoints(grid[complete.cases(grid),])
# from https://stackoverflow.com/a/27444208
nearest_points <-
apply(gDistance(temp_coord, occ_coord, byid = TRUE), 1, which.min)
The code above identifies the row number in the expanded temperature
data that corresponds to the location for each occurrence. We can then
extract and add that as a new column to palaeo_coords.
occ_temp <-
grid[complete.cases(grid),]$ocn_temp[nearest_points]
palaeo_coords <-
palaeo_coords %>%
add_column(pocn_temp = occ_temp)
We find taxa that go extinct by charting those present before the K/Pg boundary that aren’t present afterwards. Initially, do this for each species: we group into families later. Filtering out the species that go extinct, we choose those that have their last appearances after 70 Ma but before 66 Ma; you can be more precise and do after 68 Ma, but I don’t know whether the PBDB is dated precise enough to make much difference here.
species_extinctions <-
bivalve_cleaned %>%
filter(
rnk == 3,
lag >= 66 & lag <= 70
) %>%
group_by(tna_trimmed) %>%
summarize(kpg_ext = 1)
In the context of families, we can see the proportions of the families that go extinct. We choose species from the Maastrichtian and Dining, join on the previous table identifying those species that go extinct, then group into families and count the proportion that go extinction at the K/Pg boundary.
fml_extinction_proportion <-
bivalve_cleaned %>%
filter(
rnk == 3,
lag >= 61.2 & lag <= 72 & eag >= 66
) %>%
left_join(species_extinctions, by = "tna_trimmed") %>%
select(tna_trimmed, fml, kpg_ext) %>%
distinct(.keep_all = TRUE) %>%
mutate(kpg_ext = replace(kpg_ext, is.na(kpg_ext), 0)) %>%
group_by(fml, kpg_ext) %>%
summarize(n = n()) %>%
mutate(ext_prop = n / sum(n)) %>%
filter(kpg_ext == 1) # select the extinction proportion
## `summarise()` has grouped output by 'fml'. You can override using the
## `.groups` argument.
fml_extinction_proportion %>%
ggplot(aes(x = fml, y = ext_prop)) +
geom_col() +
coord_flip()
Proportions of bivalve families that go extinction across the Cretaceous/palaeogene boundary.
This last plot may also benefit from showing the number of species for each family (to gauge the relative magnitude) and another version showing the absolute numbers.
These last chunks essentially bring together most of the things we need to start modelling. These first few bits below will start with showing the general workflow for modelling extinctions across the Cretaceous–Palaeogene boundary and can then be readily extended to ask other or more specific questions. There are two levels to look at when modelling the extinction of these Bivalve taxa: (1) the family level, and (2) the species level.
You’ll be familiar with linear modelling, typified by the equation \(y = mx + c\). In our case, \(y\) is extinction — the value we’re trying to predict — while \(x\) is the predictor that we think has an effect, or drives extinction — temperature, area, salinity etc. In other language, extinction is the dependent variable while temperature, area, salinity are the independent variables. Additional complexity comes in because standard linear modelling assumes that the dependent variable is normally-distributed. What \(y = mx + c\) is shorthand for is this:
\[ y \sim \textrm{Normal} \left( \mu, \sigma \right) \\ \mu = mx + c, \]
where \(\mu\) is the mean and \(\sigma\) is a constant standard deviation. You’re predictors, \(x\) are use to work out the mean of a normal distribution, \(\mu\), which has a fixed standard deviation, \(\sigma\). Going through the normal distribution is where the ‘randomness’ inherent in the data comes from and accounts for the spread of the data about the best-fit line. Using this normal distribution model is fine for heights or lengths or other things like that, but we have two different examples:
This is where the extension of generalized linear modelling comes in: this relaxes the requirement to use a normal distribution for the dependent variable, so now we can predict proportions or discrete values.
Family extinctions we’ve calculated as the proportion of the family that goes extinction (between 0–1). Therefore, to model this we need to use a distribution that extends between 0–1, but not beyond that. This is prime territory for the beta distribution, so our model instead looks like this:
\[ E_i \sim \textrm{Beta} \left( \bar{p}_i, \theta \right) \\ \textrm{logit} \left( \bar{p}_i \right) = m_x x_i + c \\ m_x \sim \textrm{Normal} \left( 0, 10 \right) \\ c \sim \textrm{Normal} \left( 0, 10 \right) \\ \theta = \phi + 2 \\ \phi \sim \textrm{Normal} \left( 0, 10 \right), \]
where \(E\) is the proportion of the family that goes extinct, \(\bar{p}\) is the Beta distribution mean and \(\theta\) its shape parameter, \(x\) is the predictor, \(m\) and \(c\) are the coefficient and intercept of the linear model. For the Beta distribution, \(\theta\) determines the shape: whether the probabilities are condensed (e.g. 0.4–0.6) into the middle or focused around the edges (e.g. 0 or 1) with \(\theta = 2\) giving a flat distribution. Here we add the extra variable \(\phi\) so that \(\theta\) is centred around 2. The logit expression is needed to convert the ingoing model into a value between 0–1 as is required for \(\bar{p}\).
The additional feature here is that this is a more Bayesian realization of generalized linear modelling: the parameters in the model (\(\bar{p}, m_x, c, and \phi\)) all have probability distributions associated with them. In essence, rather that heading towards definite values, this model and these distributions incorporate uncertainty in measurements and models so that we end up with probabilities for a variety of different results, with the ‘best’ linear model having the highest probability. The model can’t tell you the exact level of extinction that a certain driver might generate (like with a very simplified linear model might suggest), but will give you the probability of different amounts of extinction, with one level hopefully being the most probable.
Most of the modelling can be done in R, but often is ‘slow’: adding complicated probability distributions is difficult, especially as you add more data and more complex models. This is added to R repeatedly having to keep checking between the data, working out what calculations it needs to do to get to the model values, then tell the computer to do those things. Instead we’ll use the package brms that uses an external language called Stan. Stan is designed specifically for generalized linear modelling and is also compiled: this means that it starts by taking the model and making a little program that can then skip past all the copying and checking R does and goes straight the computer and calculates. Fast.
The other feature of Stan, and one that’s often used when Bayesian inference is involved, is rather than sampling and calculating the probabilities directly, which is difficult, is instead to push some values through and calculate a result, then shift the parameters and calculate it again. By doing this repeatedly, you can then build up a picture of what the final distribution looks like without having to calculate it in full, complicated detail. This is called Monte Carlo sampling (MC) and often has some extra details called Metropolis Coupled Monte Carlo (MCMC).
brms is an R package that creates the Stan code from the data and model and tells Stan to run an MCMC analysis then parses the results ready to be plotted. It installs all the bits needed, so we don’t need to leave R to do these analyses.
This first model will use the family proportions of extinction and
will model this related to the amount of geographical space that the
family encompasses. This space will be calculated using a minimum
spanning tree (MST). An MST plots the shortest branching path
between all the points included in a set of data: the more spread out
the points, the longer the MST length. MSTDist in the chunk
below calculates the MST for each family at the K/Pg boundary, and the
results are joined to the proportions calculated earlier.
family_mst_data <-
palaeo_coords %>%
st_drop_geometry() %>%
group_by(fml) %>%
summarize(
n_occs = n(),
mst = MSTDist(plng, plat)$MST_km
) %>%
select(fml, n_occs, mst) %>%
left_join(fml_extinction_proportion, by = "fml")
family_mst_data %>%
ggplot(aes(x = fml, y = mst)) +
geom_col() +
coord_flip()
family_mst_data %>%
ggplot(aes(x = mst, y = ext_prop)) +
geom_point()
This last plot is a good one to check as it’s the linear model we want to test: extinction against MST length. There’s hints of a positive relationship but the data are certainly well spread.
A few rows have NA for their extinction values; these
are the families that don’t appear until after the K/Pg boundary. We’ll
remove these in a moment, but we’re nearly ready to go in into
modelling. The last bit will be to standardize the MST values: change
all the values so that they have a mean 0 and standard deviation of 1.
This reduces the effects of scale when doing the modelling; we will
convert back afterwards when we need to.
mst_scaling <-
scale(family_mst_data$mst)
family_mst_scaled <-
family_mst_data %>%
mutate(mst_stand = mst_scaling[, 1]) %>%
filter(!is.na(ext_prop)) %>%
select(fml, mst_stand, ext_prop)
Now we generate the linear model. This takes the form above, but I’ll replace a few of the parameter symbols to match that we’re using MST:
\[ E_i \sim \textrm{ZOIBeta} \left( \bar{p}_i, \phi \right) \\ \textrm{logit} \left( \bar{p}_i \right) = \alpha + \beta_M M_i \\ \beta_M \sim \textrm{Normal} \left(0, 10 \right) \\ \alpha \sim \textrm{Normal} \left( 0, 10 \right) \\ \phi \sim \textrm{Gamma} \left( 0.01, 0.01 \right), \]
where \(E_i\) is the proportion of the family that goes extinct, \(M_i\) is the scaled minimum spanning tree for the whole family, \(\alpha\) is the intercept, and \(\beta_M\) is the coefficient for \(M_i\) in the linear model. Here we use a Zero-One-Inflated Beta distribution (ZOIBeta) as the data has many proportions of 1.0 for our extinctions; these are likely a combination of total extinction and small numbers of species.
We now implement this in brms. This borrows the formula
notation used in base R linear models, but adds in the extras needed for
generalized linear modelling. The first thing that’s useful to do is to
show what parameters need to be defined initially with
get_priors.
get_prior(
ext_prop ~ 0 + mst_stand,
data = family_mst_scaled,
family = zero_one_inflated_beta()
)
## prior class coef group resp dpar nlpar lb ub
## (flat) b
## (flat) b mst_stand
## beta(1, 1) coi 0 1
## gamma(0.01, 0.01) phi 0
## beta(1, 1) zoi 0 1
## source
## default
## (vectorized)
## default
## default
## default
Here we see how the model is defined with
ext_prop ~ mst_stand, i.e. the extinction proportion is
related to the standardized MST length. Then we specify the data and the
family or distribution to be used in the model, in this case
the Beta distribution. This also includes the logit function needed. You
can see all the distributions and several useful examples for
brms in the vignettes and its website. A
list of the available families is here
or in R help (?brmsfamily).
brms is clever in setting out all the bits of the model we
need and providing sensible defaults. We can change these prior
distributions (i.e. those we start the model with) in the main
function using prior as you can see below. Running the
model itself uses the same form in the function brm: call
the formula, data, and family, then set any priors. Below we also setup
brm to sample from the prior, so we can see the before and
after, and then some housekeeping on how many jumps we want the MCMC
analysis to do. More jumps samples better but takes more time, and the
algorithm is clever so that it learns as it goes; thus after a while
adding more iterations (iter) doesn’t change the results
and are wasted. Getting there quicker is helped by having more
chains, which are essentially repeats of the analysis that can
be combined, and using more cores on your computer so that you
do several of these repeated analyses at the same time.
Run this code block and then we’ll see what the output looks like.
model_family_mst <-
brm(
ext_prop ~ mst_stand,
data = family_mst_scaled,
family = zero_one_inflated_beta(),
prior = prior(normal(0, 10), coef = "mst_stand", class = "b") +
prior(normal(0, 10), class = "Intercept"),
sample_prior = TRUE,
iter = 4000, chains = 4, cores = 4,
save_pars = save_pars()
)
## Compiling Stan program...
## Start sampling
Once the brm function is run, the Stan program is
compiled and then the analysis proceeds. If you’re running the same
analysis again then it shouldn’t require recompiling and goes straight
to the analysis.
summary(model_family_mst)
## Family: zero_one_inflated_beta
## Links: mu = logit; phi = identity; zoi = identity; coi = identity
## Formula: ext_prop ~ mst_stand
## Data: family_mst_scaled (Number of observations: 32)
## Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
## total post-warmup draws = 8000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.57 0.17 0.23 0.90 1.00 8459 5488
## mst_stand 0.31 0.18 -0.04 0.66 1.00 9120 5610
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## phi 10.08 3.17 4.90 17.20 1.00 8970 5813
## zoi 0.38 0.08 0.23 0.55 1.00 10927 6026
## coi 0.93 0.07 0.76 1.00 1.00 7157 3644
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
plot(model_family_mst)
pp_check(model_family_mst, ndraws = 100)
conditional_effects(model_family_mst)
The last line, conditional_effects plots the
relationship between the input and output, i.e. the ‘line of best fit’
between minimum spanning tree and family extinction proportion. This
will probably the most interesting and useful. You can use the argument
plot = FALSE to save the values separately.
Species extinctions are instead measured as a no/yes, coded as 0/1, so we instead use the binomial distribution generalized linear model. The binomial distribution shows the probability of success or failure for a given number of attempts (think coin tosses and getting heads or tails). In our case, the number of attempts is 1: a species can only go extinct once! (This makes it a special case of the binomial distribution called the Bernoulli distribution.) The parameter \(p\) then becomes the probability that the species will go extinct; this is what we’re trying to predict.
\[ E \sim \textrm{Bernoulli} \left( p_i \right) \\ \textrm{logit} \left( p_i \right) = \alpha + \beta_T T_i \\ \alpha \sim \textrm{Normal} \left( 0, 10 \right) \\ \beta_T \sim \textrm{Normal} \left( 0, 10 \right), \]
where \(E\) is whether a species goes extinct or not, \(p\) is the probability of extinction, \(T\) is the average temperature for that species, and \(\alpha\) and \(\beta\) are the coefficients.
In fact, we can go a step further and use all the occurrences rather than grouping by species and using the temperature at each occurrence as the input. This requires a multilevel or hierarchical model. Rather than grouping the family data (and summarizing with a mean value or similar), we instead input the individual occurrences and model each one and the trend across whole species. The model is the same except for the second line:
\[ \textrm{logit} \left( p_i \right) 1= \alpha + \beta_{T[species]} T_i, \]
where \(\beta_{T[species]}\) now indicates the coefficient on a per-species basis. We create our data.
species_temp_scaled <-
palaeo_coords %>%
st_drop_geometry() %>%
filter(rnk == 3) %>%
left_join(species_extinctions, by = "tna_trimmed") %>%
mutate(
tna_trimmed = as_factor(tna_trimmed),
pocn_temp_scaled = scale(pocn_temp)[, 1]
) %>%
select(tna_trimmed, pocn_temp_scaled, kpg_ext, fml, plng, plat)
Implementing this in brms is similarly not too difficult to do either:
get_prior(
kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed),
data = species_temp_scaled,
family = binomial()
)
## prior class coef group resp dpar
## lkj(1) cor
## lkj(1) cor tna_trimmed
## student_t(3, 0, 2.5) Intercept
## student_t(3, 0, 2.5) sd
## student_t(3, 0, 2.5) sd tna_trimmed
## student_t(3, 0, 2.5) sd Intercept tna_trimmed
## student_t(3, 0, 2.5) sd pocn_temp_scaled tna_trimmed
## nlpar lb ub source
## default
## (vectorized)
## default
## 0 default
## 0 (vectorized)
## 0 (vectorized)
## 0 (vectorized)
The syntax (pocn_temp_scaled | tna_trimmed) estimates
the temperature coefficient for each species grouping. This creates a
correlation matrix with the model (lkj(1) in the priors).
We then run the model itself. This will be slower than the family-level
analysis because it is estimating much more from more data.
model_species_temp <-
brm(
# kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed),
kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed + fml),
data = species_temp_scaled,
family = bernoulli(),
sample_prior = TRUE,
iter = 4000, chains = 4, cores = 4,
save_pars = save_pars()
)
## Compiling Stan program...
## Start sampling
And we can summarize with these.
summary(model_species_temp)
## Family: bernoulli
## Links: mu = logit
## Formula: kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed + fml)
## Data: species_temp_scaled (Number of observations: 3354)
## Draws: 4 chains, each with iter = 4000; warmup = 2000; thin = 1;
## total post-warmup draws = 8000
##
## Group-Level Effects:
## ~fml (Number of levels: 31)
## Estimate Est.Error l-95% CI u-95% CI Rhat
## sd(Intercept) 1.89 1.78 0.07 6.41 1.00
## sd(pocn_temp_scaled) 1.19 1.15 0.04 4.15 1.00
## cor(Intercept,pocn_temp_scaled) 0.02 0.58 -0.95 0.95 1.00
## Bulk_ESS Tail_ESS
## sd(Intercept) 5459 3633
## sd(pocn_temp_scaled) 4887 3807
## cor(Intercept,pocn_temp_scaled) 6788 3901
##
## ~tna_trimmed (Number of levels: 200)
## Estimate Est.Error l-95% CI u-95% CI Rhat
## sd(Intercept) 1.78 1.67 0.06 6.00 1.00
## sd(pocn_temp_scaled) 1.06 0.99 0.03 3.60 1.00
## cor(Intercept,pocn_temp_scaled) 0.02 0.58 -0.95 0.94 1.00
## Bulk_ESS Tail_ESS
## sd(Intercept) 6404 3929
## sd(pocn_temp_scaled) 5430 3196
## cor(Intercept,pocn_temp_scaled) 9980 4749
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 25.94 21.24 9.63 74.96 1.00 1482 816
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
plot(model_species_temp)
pp_check(model_species_temp, ndraws = 100)
It may not appear different to the family run above, but noticed under the summary that there are 200 levels in the group-level effects: these are the coefficients for each species.
It may also be worth grouping by family too with
kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed | fml) or
kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed + fml). Have
a look at this this
document and ?brmsformula for a description of the
formula syntax that brms uses.
The summary plots above give some good starting formation, but more detailed plotting is needed to fully understand the output. This is where the bayesplot package comes in: it holds many useful functions for easily plotting the output of brms and other Stan outputs. Have a look at this link for more examples.
Posterior intervals are probably the most useful outputs. These show the range of the output coefficients and their mean values. In other words, these are the outputs we’d want to see to tell whether there is a strong relationship between out predictor and response. The relative size can also tell whether certain species might be more strongly affected. In each case1, we first pull out the results, or draws from the model object, then use a function from bayesplot to plot the results.
posterior <- as_draws_array(model_species_temp)
mcmc_intervals(posterior, pars = vars(!matches("prior|lp__|b_|sd_|cor_")))
ggsave(file = "./fig/model-species-family-intervals.pdf", height = 80, limitsize = FALSE)
## Saving 7 x 80 in image
The parameters are estimated for all 200 or so species, hence the not
very useful plot. Also, because some of the values are large, the
coefficients are compressed. The desired parameters can be specified
with the pars argument; parnames gives a list
of the parameters to choose from.
#parnames(model_species_temp)
mcmc_areas(
posterior,
pars = vars(contains("arca"), starts_with("b_Intercept"))
)
Some amount of correlation will come because the occurrences are all
next to each other and the temperatures vary across that. We can add an
extra term gp to the formula to describe this
autocorrelation based on the palaeocoordinates. This describes
a Gaussian process (and a multivariate one at that), which
builds in the autocorrelation between the occurrence positions. The
implementation below is inspired by this
one.
In this case the Gaussian process we want works on the distances between each occurrence: closer occurrences have more similar values with the model (often on the standard deviation). Thus, we give the Gaussian process the palaeocoordinates; brms then generates the correlation structure that Stan uses in the model and MCMC.
Warning, this will be slow! (Multiple hours to complete.) I suggest playing with the models above then thinking about using this one later. When using multiple chains, the time taken for each chain can also be very different as the starting points are randomized.
model_species_temp_autocorr <-
brm(
kpg_ext ~ 1 + (pocn_temp_scaled | tna_trimmed + fml) + gp(plng, plat, scale = FALSE),
data = species_temp_scaled,
family = bernoulli(),
sample_prior = TRUE,
iter = 100, chains = 1, cores = 1,
save_pars = save_pars()
)
Another source of covariation is phylogeny, which I haven’t done yet but can include later.